Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow name_only option gensim downloader api #2143

Merged
merged 35 commits into from
Aug 3, 2018

Conversation

aneesh-joshi
Copy link
Contributor

@aneesh-joshi aneesh-joshi commented Jul 31, 2018

Currently, to get the exact names of the models or corpora, a user has to either:

  1. run gensim.info() and look through the huge json dump to get the exact names
  2. go to the gensim-data website and check

When using gensim-data, I often forget the exact key.
"Was it 'glove-wiki-gigaword' or 'glove-gigaword-wiki'?"

It would be very helpful if a user could, in the terminal or otherwise, type:
gensim.info(name_only=True) or
python -m gensim.downloader --info_name_only
ans get the following output:

{
    "corpora": [
        "semeval-2016-2017-task3-subtaskBC",
        "semeval-2016-2017-task3-subtaskA-unannotated",
        "patent-2017",
        "quora-duplicate-questions",
        "wiki-english-20171001",
        "text8",
        "fake-news",
        "20-newsgroups",
        "__testing_matrix-synopsis",
        "__testing_multipart-matrix-synopsis"
    ],
    "models": [
        "fasttext-wiki-news-subwords-300",
        "conceptnet-numberbatch-17-06-300",
        "word2vec-ruscorpora-300",
        "word2vec-google-news-300",
        "glove-wiki-gigaword-50",
        "glove-wiki-gigaword-100",
        "glove-wiki-gigaword-200",
        "glove-wiki-gigaword-300",
        "glove-twitter-25",
        "glove-twitter-50",
        "glove-twitter-100",
        "glove-twitter-200",
        "__testing_word2vec-matrix-synopsis"
    ]
}

Notes:
The current develop's downloader.py is failing the doctests without me doing anything.

@@ -29,6 +29,7 @@
Also, this API available via CLI::

python -m gensim.downloader --info <dataname> # same as api.info(dataname)
python -m gensim.downloader --info_name_only # same as api.info(name_only=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to do it as parameter of --info flag I think (instead of new --info_* flag), like --info name

@aneesh-joshi
Copy link
Contributor Author

changes made @menshikh-iv

@@ -29,6 +29,7 @@
Also, this API available via CLI::

python -m gensim.downloader --info <dataname> # same as api.info(dataname)
python -m gensim.downloader --info name_only # same as api.info(name_only=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--info name please :) (but stay name_only parameter for CLI)

@menshikh-iv
Copy link
Contributor

@aneesh-joshi thanks!

@menshikh-iv menshikh-iv merged commit 4520adf into piskvorky:develop Aug 3, 2018
@aneesh-joshi aneesh-joshi deleted the name_only_develop branch August 3, 2018 06:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants